2023-11-23 Rounds evaluated with zero scores

Incident summary

Several SPARK rounds were incorrectly evaluated with empty scores (all times in UTC):

  • Wed, Nov 22th: 18:52, 19:53, …, 23:57
  • Thu, Nov 23th: 00:58, …, 04:01, 06:04, …, 10:08

While we identified the problem early, we could not troubleshoot it quickly because we reached the ingestion limit of our log monitoring service and had to wait for the next round boundary to watch live logs.

Once we got the logs, we quickly identified the problem as an out-of-memory error and fixed it by increasing the memory size available to our spark-evaluate service.

Impact

Station operators received no rewards for 15 rounds.

Corrective actions

  • We will implement a new alert to inform us when a round is evaluated with less than ten scoring participants (spark-evaluate#69).
  • We will implement alerts for the situation when any of our Node.js services crashes hard (e.g. on an out-of-memory error).
  • In the longer term, we want to refactor our evaluation service to handle the increasing load of measurements in a constant memory space (spark-evaluate#5)